Cluster-based language model for spoken document retrieval using NMF-based document clustering

نویسندگان

Xinhui Hu

Ryosuke Isotani

Hisashi Kawai

Satoshi Nakamura

چکیده

In this paper, a non-negative matrix factorization (NMF)based document clustering approach is proposed for the cluster-based language model for spoken document retrieval. The retrieval language model comprises three different unigram models: a whole corpus collect-based unigram, documentbased unigram, and a document clustering-based unigram. They are combined with double linear interpolations. Document clustering is realized via the NMF method; each document is clustered into an axis in which it has maximum projection in the latent semantic space derived by the NMF. The initialization of NMF, which is an important factor influencing NMF performance, is based on the clustered results of the K-means clustering approach. Using these approaches, retrieval experiments are conducted on a test collection from the corpus of spontaneous Japanese (CSJ). It is found that the proposed method significantly outperforms the conventional vector space model (VSM), the maximum improvement of the retrieval performance (mean average precision: MAP) exceeds 36%, outstripping the conventional query likelihood model, which has improvement of 7.4%. It is also found that the proposed method surpasses the K-means clustering method when adequate initialization of NMF is used.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ping-pong Document Clustering using NMF and Linkage-Based Refinement

This paper proposes a ping-pong document clustering method using NMF and the linkage based refinement alternately, in order to improve the clustering result of NMF. The use of NMF in the ping-pong strategy can be expected effective for document clustering. However, NMF in the ping-pong strategy often worsens performance because NMF often fails to improve the clustering result given as the initi...

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

متن کامل

Document Clustering using Weighting and Labels based on Inherent Structure of Document

In classic document clustering, documents appear terms frequency without considering the semantic information of each document (i.e., vector model). The property of vector model may be incorrectly classified documents into different clusters when documents of same cluster lack the shared terms. Recently, to overcome this problem uses knowledge based approaches. However, these approaches have an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Cluster-based language model for spoken document retrieval using NMF-based document clustering

نویسندگان

چکیده

منابع مشابه

Ping-pong Document Clustering using NMF and Linkage-Based Refinement

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Improved Skips for Faster Postings List Intersection

Improved Skips for Faster Postings List Intersection

Document Clustering using Weighting and Labels based on Inherent Structure of Document

عنوان ژورنال:

اشتراک گذاری